11.2.1 Exercises

  1. What happens if you repeat the above analysis with all diamonds? (Not just all diamonds with two or fewer carats). What does the strange geometry of log(carat) vs relative price represent? What does the diagonal line without any points represent?

Consider the linear relationship between lcarat and lprice:

diamonds2 <- diamonds %>% 
  mutate(
    lcarat = log2(carat),
    lprice = log2(price)
  )
diamonds2
## # A tibble: 53,940 x 12
##    carat       cut color clarity depth table price     x     y     z
##    <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
##  2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
##  3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
##  4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
##  5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
##  9  0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
## # ... with 53,930 more rows, and 2 more variables: lcarat <dbl>,
## #   lprice <dbl>
ggplot(diamonds2, aes(lcarat,lprice)) +
  geom_bin2d() +
  geom_smooth(method="lm",se=FALSE,size=2,color="yellow")

We can see above that the x-axis has been extended to the right compared to the original plot (since the original plot was limited to 2 or fewer carats). The range of lprice did not increase.

mod <- lm(lprice ~ lcarat, data=diamonds2)
diamonds2 <- diamonds2 %>% mutate(rel_price = resid(mod))
diamonds2
## # A tibble: 53,940 x 13
##    carat       cut color clarity depth table price     x     y     z
##    <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
##  2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
##  3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
##  4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
##  5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
##  9  0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
## # ... with 53,930 more rows, and 3 more variables: lcarat <dbl>,
## #   lprice <dbl>, rel_price <dbl>
ggplot(diamonds2,aes(carat,rel_price)) + geom_bin2d()

The above plot is not good as it is showing a distinct pattern in the residuals. The linear model is overpredicting the log price for large carats. The variance of the residuals is significantly higher for smaller carats. Looking back at the picture of the linear model, the model extends beyond the range of lprice, thus there are only points below the line at this point (which is why we see negative residual.)

color_cut <- diamonds2 %>%
  group_by(color, cut) %>%
  summarize(
    price = mean(price),
    rel_price = mean(rel_price)
  )
color_cut
## # A tibble: 35 x 4
## # Groups:   color [?]
##    color       cut    price   rel_price
##    <ord>     <ord>    <dbl>       <dbl>
##  1     D      Fair 4291.061 -0.06695865
##  2     D      Good 3405.382 -0.03902942
##  3     D Very Good 3470.467  0.10763401
##  4     D   Premium 3631.293  0.11368869
##  5     D     Ideal 2629.095  0.21605978
##  6     E      Fair 3682.312 -0.16154694
##  7     E      Good 3423.644 -0.04676044
##  8     E Very Good 3214.652  0.06901578
##  9     E   Premium 3538.914  0.08742779
## 10     E     Ideal 2597.550  0.17323162
## # ... with 25 more rows
ggplot(color_cut, aes(color, price)) +
  geom_line(aes(group=cut), color="grey80") + 
  geom_point(aes(color = cut))

Above we see the same trend as in the book, where the lowest quality diamonds (color=J) hae the highest price, especially for Premium cut diamonds. Plotting relative price against color solves the issue.

ggplot(color_cut, aes(color, rel_price)) +
  geom_line(aes(group = cut), color = "grey80") + 
  geom_point(aes(color=cut))

  1. I made an unsupported assertion that lower-quality diamonds tend to be larger. Support my claim with a plot.
ggplot(diamonds, aes(color, carat)) + geom_boxplot()

  1. Can you create a plot that simultaneously shows the effect of color, cut, and clarity on relative price? If there’s too much information to show on one plot, think about how you might create a sequence of plots to convey the same message.
color_cut_clarity <- diamonds2 %>%
  group_by(color, cut, clarity) %>%
  summarize(
    price = mean(price),
    rel_price = mean(rel_price)
  )
ggplot(color_cut_clarity, aes(color,rel_price)) + 
  geom_line(aes(group=cut), color="grey80") + 
  geom_point(aes(color=cut)) + 
  facet_wrap(~clarity)

  1. How do depth and table related to the relative price?

Table and depth do not have much relatiohship with relative price. This makes sense as table and depth are highly correlated to carat (weight) and we’ve already taken out the effect of carat.

#diamonds2
ggplot(diamonds2,aes(depth,rel_price)) +
  geom_bin2d()

#diamonds2
ggplot(diamonds2,aes(table,rel_price)) +
  geom_bin2d()

11.3.1 Exercises

  1. The final plot shows a lot of short-term noise in the overall trend. How could you smooth this further to focus on long-term changes?

Instead of using geom_line, we can draw a loess curve using geom_smooth:

txhousing
## # A tibble: 8,602 x 9
##       city  year month sales   volume median listings inventory     date
##      <chr> <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl>    <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000.000
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.083
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.167
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.250
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.333
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.417
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.500
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2000.583
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2000.667
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2000.750
## # ... with 8,592 more rows
#de-season function:
deseas <- function(x,month){
  resid(lm(x ~ factor(month), na.action = na.exclude))
}

txhousing <- txhousing %>%
  group_by(city) %>%
  mutate(rel_sales = deseas(log(sales), month))

ggplot(txhousing, aes(date,rel_sales)) + 
  geom_line(aes(group=city), alpha = 1/5) +
  #geom_line(stat = "summary", fun.y="mean", color="red")
  geom_smooth(method="loess",se=FALSE)
## Warning: Removed 568 rows containing non-finite values (stat_smooth).
## Warning: Removed 430 rows containing missing values (geom_path).

  1. If you look closely (e.g. + xlim(2008,2012)) at the long-term trend you’ll notice a weird pattern in 2009-2011. It looks like there was a big dip in 2010. Is this dip “real”? (i.e. can you spot it in the original data)

Looking at the relative sales (after taking out the month effect), a number of cities do indeed tend to have a dip in 2010:

ggplot(txhousing, aes(date,rel_sales)) + 
  geom_line(aes(group=city), alpha = 1/5) +
  geom_line(stat = "summary", fun.y="mean", color="red") +
  #geom_smooth(method="loess",se=FALSE) +
  xlim(2008,2012)
## Warning: Removed 6377 rows containing non-finite values (stat_summary).
## Warning: Removed 6376 rows containing missing values (geom_path).

  1. What other variables in the TX housing data show strong seasonal effects? Does this technique help to remove them?
summary(txhousing)
##      city                year          month            sales       
##  Length:8602        Min.   :2000   Min.   : 1.000   Min.   :   6.0  
##  Class :character   1st Qu.:2003   1st Qu.: 3.000   1st Qu.:  86.0  
##  Mode  :character   Median :2007   Median : 6.000   Median : 169.0  
##                     Mean   :2007   Mean   : 6.406   Mean   : 549.6  
##                     3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.: 467.0  
##                     Max.   :2015   Max.   :12.000   Max.   :8945.0  
##                                                     NA's   :568     
##      volume              median          listings       inventory     
##  Min.   :8.350e+05   Min.   : 50000   Min.   :    0   Min.   : 0.000  
##  1st Qu.:1.084e+07   1st Qu.:100000   1st Qu.:  682   1st Qu.: 4.900  
##  Median :2.299e+07   Median :123800   Median : 1283   Median : 6.200  
##  Mean   :1.069e+08   Mean   :128131   Mean   : 3217   Mean   : 7.175  
##  3rd Qu.:7.512e+07   3rd Qu.:150000   3rd Qu.: 2954   3rd Qu.: 8.150  
##  Max.   :2.568e+09   Max.   :304200   Max.   :43107   Max.   :55.900  
##  NA's   :568         NA's   :616      NA's   :1424    NA's   :1467    
##       date        rel_sales      
##  Min.   :2000   Min.   :-1.8473  
##  1st Qu.:2004   1st Qu.:-0.1404  
##  Median :2008   Median : 0.0117  
##  Mean   :2008   Mean   : 0.0000  
##  3rd Qu.:2012   3rd Qu.: 0.1531  
##  Max.   :2016   Max.   : 0.9080  
##                 NA's   :568
#"Months inventory": amount of time it would take to sell all current listings at current pace of sales. 
ggplot(txhousing, aes(date,log(inventory))) + 
  geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 603 rows containing missing values (geom_path).

#Total active listings
ggplot(txhousing, aes(date,log(listings))) +
  geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 518 rows containing missing values (geom_path).

#Total value of sales
ggplot(txhousing, aes(date,log(volume))) +
  geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 430 rows containing missing values (geom_path).

#Median sale price:
ggplot(txhousing, aes(date,log(median))) +
  geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 446 rows containing missing values (geom_path).

  1. Not all the cities in this data set have complete time series. Use your dplyr skills to figure out how much data each city is missing. Display the results with a visualization.
numDates <- txhousing %>% 
  summarize(nDates=n_distinct(date)) %>% 
  select(nDates) %>% 
  max()
numDates
## [1] 187
cityCnts <- txhousing %>%
  na.omit() %>%
  group_by(city) %>%
  summarize(nComplete=n()) %>%
  mutate(pctComplete=nComplete/numDates)
cityCnts
## # A tibble: 46 x 3
##                     city nComplete pctComplete
##                    <chr>     <int>       <dbl>
##  1               Abilene       186   0.9946524
##  2              Amarillo       182   0.9732620
##  3             Arlington       186   0.9946524
##  4                Austin       187   1.0000000
##  5              Bay Area       186   0.9946524
##  6              Beaumont       187   1.0000000
##  7       Brazoria County       129   0.6898396
##  8           Brownsville        82   0.4385027
##  9 Bryan-College Station       187   1.0000000
## 10         Collin County       186   0.9946524
## # ... with 36 more rows
ggplot(cityCnts, aes(city,pctComplete)) + 
  geom_col() +
  labs(x="City", y="% Complete") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

  1. Replicate the computation that stat_summary() did with dplyr so you can plot the data “by hand”.

Comment: I think this question is refering to the geom_line shown on page 229 using stat="summary" and fun.y="mean".

meanRelSalesData <- txhousing %>%
  group_by(date) %>%
  summarize(meanRelSales = mean(rel_sales, na.rm=TRUE))
meanRelSalesData
## # A tibble: 187 x 2
##        date meanRelSales
##       <dbl>        <dbl>
##  1 2000.000   -0.2625059
##  2 2000.083   -0.1535309
##  3 2000.167   -0.1775694
##  4 2000.250   -0.3273133
##  5 2000.333   -0.2263179
##  6 2000.417   -0.2185162
##  7 2000.500   -0.3047979
##  8 2000.583   -0.1751718
##  9 2000.667   -0.2235483
## 10 2000.750   -0.2263248
## # ... with 177 more rows
ggplot(txhousing, aes(date,rel_sales)) + 
  geom_line(aes(group=city), alpha = 1/5) +
  geom_line(aes(date,meanRelSales),data=meanRelSalesData,color="red")
## Warning: Removed 430 rows containing missing values (geom_path).

11.5.1 Exercises

  1. Do your conclusions change if you use a different measurement of model fit like AIC or deviance? Why/ why not?
#install.packages("broom")
library(broom)
models <- txhousing %>%
  group_by(city) %>%
  do(mod = lm(log2(sales) ~ factor(month), data = ., na.action = na.exclude))
models
## Source: local data frame [46 x 2]
## Groups: <by row>
## 
## # A tibble: 46 x 2
##                     city      mod
##  *                 <chr>   <list>
##  1               Abilene <S3: lm>
##  2              Amarillo <S3: lm>
##  3             Arlington <S3: lm>
##  4                Austin <S3: lm>
##  5              Bay Area <S3: lm>
##  6              Beaumont <S3: lm>
##  7       Brazoria County <S3: lm>
##  8           Brownsville <S3: lm>
##  9 Bryan-College Station <S3: lm>
## 10         Collin County <S3: lm>
## # ... with 36 more rows
model_sum <- models %>% glance(mod)
model_sum
## # A tibble: 46 x 12
## # Groups:   city [46]
##                     city r.squared adj.r.squared     sigma statistic
##                    <chr>     <dbl>         <dbl>     <dbl>     <dbl>
##  1               Abilene 0.5298893     0.5003394 0.2817299 17.932066
##  2              Amarillo 0.4493699     0.4147588 0.3015624 12.983427
##  3             Arlington 0.5132337     0.4826369 0.2673618 16.774130
##  4                Austin 0.4873283     0.4551032 0.3102974 15.122641
##  5              Bay Area 0.5552242     0.5272669 0.2646016 19.859696
##  6              Beaumont 0.4304326     0.3946312 0.2754690 12.022792
##  7       Brazoria County 0.5082062     0.4746054 0.2919275 15.124817
##  8           Brownsville 0.1714455     0.1187629 0.4196615  3.254307
##  9 Bryan-College Station 0.6511249     0.6291956 0.4061146 29.692015
## 10         Collin County 0.6009165     0.5758312 0.2658377 23.954970
## # ... with 36 more rows, and 7 more variables: p.value <dbl>, df <int>,
## #   logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
ggplot(model_sum, aes(AIC, reorder(city,AIC))) +
  geom_point()

Looking at AIC: Harlingen, San Marcos, and Brownsville have the highest AIC (2 of the three are the same cities that had the lowest \(R^2\)). Dallas, El Paso, and Midland have the lowest AIC, which are different than those having the highest \(R^2\). (Recall low AIC is better.)

worstAIC <- c("Harlingen", "San Marcos", "Brownsville")
bestAIC <- c("Dallas", "El Paso", "Midland")
extreme <- txhousing %>% ungroup() %>%
  filter(city %in% c(worstAIC,bestAIC), !is.na(sales)) %>%
  mutate(city = factor(city, c(worstAIC, bestAIC)))

ggplot(extreme, aes(month, log(sales))) +
  geom_line(aes(group=year)) +
  facet_wrap(~city)

  1. One possible hypothesis that explains why McAllen, Harlingen and Brownsville have lower \(R^2\) is that they’re smaller towns so there are fewer sales and more noise. Confirm or refute this hypothesis.

Looking at the plot on page 234, McAllen (low \(R^2\)) and Bryan-College Station (high \(R^2\)) have similar order of magnitude of sales. This is also apparent looking at a boxplot of their sales.

txhousing %>% ungroup() %>%
  group_by(city) %>%
  filter(city %in% c("McAllen","Bryan-College Station"), !is.na(sales)) %>%
  summarize(meanSales=mean(sales))
## # A tibble: 2 x 2
##                    city meanSales
##                   <chr>     <dbl>
## 1 Bryan-College Station  186.7433
## 2               McAllen  162.1730
selectCities <- txhousing %>%
  filter(city %in% c("McAllen","Bryan-College Station"), !is.na(sales)) 

ggplot(selectCities, aes(city,sales)) + geom_boxplot()

If we look at all of the extreme cities, however, we do see that NE Tarrant County (high \(R^2\)) does have significantly more sales. Also, Brownsville and Harlengen (both having low \(R^2\)) do have fewer sales than the others.

selectCities <- txhousing %>%
  filter(city %in% c("McAllen","Bryan-College Station", "Lubbock", "NE Tarrant County", "Brownsville", "Harlingen"), !is.na(sales)) 

ggplot(selectCities, aes(city,sales)) + geom_boxplot()

  1. McAllen, Harlingen and Brownsville seem to have much more year-to-year variation that Bryan-College Station, Lubbock, and NE Tarrant County. How does the model change if you also include a linear trend for year? (i.e. log(sales) ~ factor(month) + year).
models <- txhousing %>%
  group_by(city) %>%
  do(mod = lm(log2(sales) ~ factor(month) + year, data = ., na.action = na.exclude))
models
## Source: local data frame [46 x 2]
## Groups: <by row>
## 
## # A tibble: 46 x 2
##                     city      mod
##  *                 <chr>   <list>
##  1               Abilene <S3: lm>
##  2              Amarillo <S3: lm>
##  3             Arlington <S3: lm>
##  4                Austin <S3: lm>
##  5              Bay Area <S3: lm>
##  6              Beaumont <S3: lm>
##  7       Brazoria County <S3: lm>
##  8           Brownsville <S3: lm>
##  9 Bryan-College Station <S3: lm>
## 10         Collin County <S3: lm>
## # ... with 36 more rows
model_sum <- models %>% glance(mod)
model_sum
## # A tibble: 46 x 12
## # Groups:   city [46]
##                     city r.squared adj.r.squared     sigma statistic
##                    <chr>     <dbl>         <dbl>     <dbl>     <dbl>
##  1               Abilene 0.6843255     0.6625548 0.2315246 31.433387
##  2              Amarillo 0.5313000     0.4989758 0.2790224 16.436631
##  3             Arlington 0.5773500     0.5482018 0.2498469 19.807350
##  4                Austin 0.6540551     0.6301968 0.2556268 27.414184
##  5              Bay Area 0.6608046     0.6374118 0.2317349 28.248216
##  6              Beaumont 0.5200054     0.4869023 0.2536079 15.708670
##  7       Brazoria County 0.5091646     0.4723519 0.2925529 13.831237
##  8           Brownsville 0.2355866     0.1822555 0.4042607  4.417429
##  9 Bryan-College Station 0.8510313     0.8407576 0.2661372 82.835859
## 10         Collin County 0.6889276     0.6674743 0.2353747 32.112942
## # ... with 36 more rows, and 7 more variables: p.value <dbl>, df <int>,
## #   logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
ggplot(model_sum, aes(r.squared, reorder(city,r.squared))) +
  geom_point()

Harlingen moved up 10 places (comparing the \(R^2\) values) by adding year to the model.

  1. Create a faceted plot that shows the seasonal patterns for all cities. Order the facets by the \(R^2\) for the city.
models <- txhousing %>%
  group_by(city) %>%
  do(mod = lm(log2(sales) ~ factor(month), data = ., na.action = na.exclude))

model_sum <- models %>% glance(mod)
model_sum
## # A tibble: 46 x 12
## # Groups:   city [46]
##                     city r.squared adj.r.squared     sigma statistic
##                    <chr>     <dbl>         <dbl>     <dbl>     <dbl>
##  1               Abilene 0.5298893     0.5003394 0.2817299 17.932066
##  2              Amarillo 0.4493699     0.4147588 0.3015624 12.983427
##  3             Arlington 0.5132337     0.4826369 0.2673618 16.774130
##  4                Austin 0.4873283     0.4551032 0.3102974 15.122641
##  5              Bay Area 0.5552242     0.5272669 0.2646016 19.859696
##  6              Beaumont 0.4304326     0.3946312 0.2754690 12.022792
##  7       Brazoria County 0.5082062     0.4746054 0.2919275 15.124817
##  8           Brownsville 0.1714455     0.1187629 0.4196615  3.254307
##  9 Bryan-College Station 0.6511249     0.6291956 0.4061146 29.692015
## 10         Collin County 0.6009165     0.5758312 0.2658377 23.954970
## # ... with 36 more rows, and 7 more variables: p.value <dbl>, df <int>,
## #   logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
#TODO: Order by r.squared. Commands below are not giving me sorted cities.
#Join txhousing to model summary. Then sort by r squared value:
txhousingSorted <- left_join(txhousing,model_sum,by="city") %>% arrange(r.squared) %>% ungroup
#cityOrdered <- model_sum %>% arrange(r.squared) %>% select(city) %>% ungroup
#cityOrdered  <- txhousingSorted %>% select(city) %>% unique
#factor(txhousing$city,levels=cityOrdered)
#ggplot(transform(txhousing, city=factor(txhousing$city,levels=cityOrdered)), 
#       aes(month, log(sales))) +
#  geom_line(aes(group=year)) +
#  facet_wrap(~city)


ggplot(txhousingSorted, aes(month, log(sales))) +
  geom_line(aes(group=year)) +
  facet_wrap(~city)

11.6.1 Exercises

  1. Pull out the three cities with the highest and lowest seasonal effect. Plot their coefficients.
coefs <- models %>% tidy(mod)
coefs
## # A tibble: 552 x 6
## # Groups:   city [46]
##       city            term  estimate  std.error statistic       p.value
##      <chr>           <chr>     <dbl>      <dbl>     <dbl>         <dbl>
##  1 Abilene     (Intercept) 6.5420182 0.07043248 92.883541 7.898616e-151
##  2 Abilene  factor(month)2 0.3537962 0.09960657  3.551937  4.914925e-04
##  3 Abilene  factor(month)3 0.6747930 0.09960657  6.774584  1.826729e-10
##  4 Abilene  factor(month)4 0.7489369 0.09960657  7.518950  2.758003e-12
##  5 Abilene  factor(month)5 0.9164846 0.09960657  9.201046  1.056158e-16
##  6 Abilene  factor(month)6 1.0024412 0.09960657 10.064007  4.372570e-19
##  7 Abilene  factor(month)7 0.9539842 0.09960657  9.577523  9.810940e-18
##  8 Abilene  factor(month)8 0.9337577 0.10125307  9.222019  9.259347e-17
##  9 Abilene  factor(month)9 0.6036668 0.10125307  5.961961  1.342427e-08
## 10 Abilene factor(month)10 0.6084391 0.10125307  6.009093  1.055654e-08
## # ... with 542 more rows
#Pull out months:
months <- coefs %>%
  filter(grepl("factor",term)) %>%
  tidyr::extract(term,"month","(\\d+)", convert=TRUE)
months
## # A tibble: 506 x 6
## # Groups:   city [46]
##       city month  estimate  std.error statistic      p.value
##      <chr> <int>     <dbl>      <dbl>     <dbl>        <dbl>
##  1 Abilene     2 0.3537962 0.09960657  3.551937 4.914925e-04
##  2 Abilene     3 0.6747930 0.09960657  6.774584 1.826729e-10
##  3 Abilene     4 0.7489369 0.09960657  7.518950 2.758003e-12
##  4 Abilene     5 0.9164846 0.09960657  9.201046 1.056158e-16
##  5 Abilene     6 1.0024412 0.09960657 10.064007 4.372570e-19
##  6 Abilene     7 0.9539842 0.09960657  9.577523 9.810940e-18
##  7 Abilene     8 0.9337577 0.10125307  9.222019 9.259347e-17
##  8 Abilene     9 0.6036668 0.10125307  5.961961 1.342427e-08
##  9 Abilene    10 0.6084391 0.10125307  6.009093 1.055654e-08
## 10 Abilene    11 0.4189183 0.10125307  4.137339 5.449232e-05
## # ... with 496 more rows
#Strength of seasonal effect:
coefSummary <- months %>%
  group_by(city) %>%
  summarize(max=max(estimate))
#ggplot(coefSummary, aes(2^max, reorder(city,max))) +
#  geom_point() + 
#  ggtitle("Maximum coefficient per City")

#Extract highest and lowest seasonal effect:
top3 <- coefSummary %>% top_n(3,max)
bottom3 <- coefSummary %>% top_n(-3,max)
selectCitySummary <- rbind(top3,bottom3)

#Plot their coeficients:
ggplot(selectCitySummary, aes(2^max, reorder(city,max))) +
  geom_point() + 
  ggtitle("Maximum 2^coefficients for Cities \nwith highest and lowest seasonal effect")

  1. How does strength of seasonal effect relate to the \(R^2\) for the model? Answer with a plot.

There seems to be a positive relationship between \(R^2\) and seasonal effect:

#model_sum contains r.squared for each city 
#coefSummary contains max coef (seasonal effect)

combined <- inner_join(model_sum,coefSummary,by="city")
combined
## # A tibble: 46 x 13
## # Groups:   city [?]
##                     city r.squared adj.r.squared     sigma statistic
##                    <chr>     <dbl>         <dbl>     <dbl>     <dbl>
##  1               Abilene 0.5298893     0.5003394 0.2817299 17.932066
##  2              Amarillo 0.4493699     0.4147588 0.3015624 12.983427
##  3             Arlington 0.5132337     0.4826369 0.2673618 16.774130
##  4                Austin 0.4873283     0.4551032 0.3102974 15.122641
##  5              Bay Area 0.5552242     0.5272669 0.2646016 19.859696
##  6              Beaumont 0.4304326     0.3946312 0.2754690 12.022792
##  7       Brazoria County 0.5082062     0.4746054 0.2919275 15.124817
##  8           Brownsville 0.1714455     0.1187629 0.4196615  3.254307
##  9 Bryan-College Station 0.6511249     0.6291956 0.4061146 29.692015
## 10         Collin County 0.6009165     0.5758312 0.2658377 23.954970
## # ... with 36 more rows, and 8 more variables: p.value <dbl>, df <int>,
## #   logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
## #   max <dbl>
ggplot(combined, aes(r.squared,2^max)) + 
  geom_point() +
  labs(y="Max 2^Coef (Seasonal effect)")

  1. You should be extra cautious when your results agree with your prior beliefs. How can you confirm or refute my hypothesis about the causes of strong seasonal patterns?

Hypothesis: Cities with strongest seasonal effect are college and beach towns.

It can be difficult sometimes to make conclusions about cause. In this case we could perform a hypothesis test comparing seasonal effect for beach and college towns vs all other towns to see if there is a statistically significant difference.

  1. Group the diamonds data by cut, clarity, and color. Fit a linear model log(price) ~ log(carat). What does the intercept tell you? What does the slope tell you? How do the slope and intercept vary across the groups? Answer with a plot.

The slope tells us how log(carat) changes in relationship with changes in log(price) (for each cut, color, and quality). The intercept tells us the starting log(price) reguarless of log(carat).

Next we’ll look at how the slope and intercept vary across groups.

models <- diamonds %>%
  group_by(cut,clarity,color) %>%
  do(mod = lm(
    log(price) ~ log(carat),
    data = .,
    na.action = na.exclude
  ))
models
## Source: local data frame [276 x 4]
## Groups: <by row>
## 
## # A tibble: 276 x 4
##      cut clarity color      mod
##  * <ord>   <ord> <ord>   <list>
##  1  Fair      I1     D <S3: lm>
##  2  Fair      I1     E <S3: lm>
##  3  Fair      I1     F <S3: lm>
##  4  Fair      I1     G <S3: lm>
##  5  Fair      I1     H <S3: lm>
##  6  Fair      I1     I <S3: lm>
##  7  Fair      I1     J <S3: lm>
##  8  Fair     SI2     D <S3: lm>
##  9  Fair     SI2     E <S3: lm>
## 10  Fair     SI2     F <S3: lm>
## # ... with 266 more rows
models %>% glance(mod) 
## # A tibble: 276 x 14
## # Groups:   cut, clarity, color [276]
##      cut clarity color r.squared adj.r.squared      sigma   statistic
##    <ord>   <ord> <ord>     <dbl>         <dbl>      <dbl>       <dbl>
##  1  Fair      I1     D 0.9937512     0.9906267 0.07378993  318.059271
##  2  Fair      I1     E 0.4701053     0.3944061 0.30213442    6.210172
##  3  Fair      I1     F 0.8560433     0.8516810 0.32529833  196.235634
##  4  Fair      I1     G 0.9738591     0.9733466 0.13199534 1899.969075
##  5  Fair      I1     H 0.9237048     0.9221789 0.19958859  605.349118
##  6  Fair      I1     I 0.9469761     0.9453191 0.13809815  571.501803
##  7  Fair      I1     J 0.9626127     0.9608324 0.15083441  540.688367
##  8  Fair     SI2     D 0.9133422     0.9117374 0.19026661  569.140643
##  9  Fair     SI2     E 0.9578375     0.9572827 0.13575207 1726.549900
## 10  Fair     SI2     F 0.9313385     0.9305493 0.18134425 1180.085357
## # ... with 266 more rows, and 7 more variables: p.value <dbl>, df <int>,
## #   logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
slope <- models %>% tidy(mod) %>% filter(grepl("carat",term))
slope
## # A tibble: 271 x 8
## # Groups:   cut, clarity, color [271]
##      cut clarity color       term estimate  std.error statistic
##    <ord>   <ord> <ord>      <chr>    <dbl>      <dbl>     <dbl>
##  1  Fair      I1     D log(carat) 1.398200 0.07839988 17.834216
##  2  Fair      I1     E log(carat) 1.242414 0.49855648  2.492022
##  3  Fair      I1     F log(carat) 1.496369 0.10681933 14.008413
##  4  Fair      I1     G log(carat) 1.623653 0.03724947 43.588635
##  5  Fair      I1     H log(carat) 1.603171 0.06515939 24.603844
##  6  Fair      I1     I log(carat) 1.652808 0.06913749 23.906104
##  7  Fair      I1     J log(carat) 1.397439 0.06009792 23.252707
##  8  Fair     SI2     D log(carat) 1.783365 0.07475331 23.856669
##  9  Fair     SI2     E log(carat) 1.826243 0.04395103 41.551774
## 10  Fair     SI2     F log(carat) 1.659332 0.04830327 34.352370
## # ... with 261 more rows, and 1 more variables: p.value <dbl>
intercept <- models %>% tidy(mod) %>% filter(grepl("Intercept",term))
intercept
## # A tibble: 276 x 8
## # Groups:   cut, clarity, color [276]
##      cut clarity color        term estimate  std.error statistic
##    <ord>   <ord> <ord>       <chr>    <dbl>      <dbl>     <dbl>
##  1  Fair      I1     D (Intercept) 7.962306 0.05477466  145.3648
##  2  Fair      I1     E (Intercept) 7.644798 0.10398708   73.5168
##  3  Fair      I1     F (Intercept) 7.654878 0.05625926  136.0643
##  4  Fair      I1     G (Intercept) 7.613672 0.01838181  414.1959
##  5  Fair      I1     H (Intercept) 7.605013 0.03418893  222.4408
##  6  Fair      I1     I (Intercept) 7.630474 0.02809506  271.5949
##  7  Fair      I1     J (Intercept) 7.624676 0.04572844  166.7382
##  8  Fair     SI2     D (Intercept) 8.243521 0.02560794  321.9127
##  9  Fair     SI2     E (Intercept) 8.206757 0.01550199  529.4001
## 10  Fair     SI2     F (Intercept) 8.174321 0.01922257  425.2459
## # ... with 266 more rows, and 1 more variables: p.value <dbl>
#Plot slope estimates:
ggplot(slope,aes(clarity,estimate,color=color)) +
  geom_point() +
  facet_wrap(~cut) +
  labs(title="Slope") + ylab("Estimate of slope")

#Plot intercept estimates:
ggplot(slope,aes(clarity,estimate,color=color)) +
  geom_point() +
  facet_wrap(~cut) +
  labs(title="Intercept") + ylab("Estimate of intercept")

11.7.1 Exercises

  1. A common diagnostic plot is fitted values (.fitted) vs. residuals (.resid). Do you see any patterns? What if you include the city or month on the same plot?
models <- txhousing %>%
  group_by(city) %>%
  do(mod = lm(log2(sales) ~ factor(month), data = ., na.action = na.exclude))

obsSummary <- models %>% augment(mod)
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument

## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
obsSummary
## # A tibble: 8,034 x 13
## # Groups:   city [46]
##       city log2.sales. factor.month.  .fitted    .se.fit     .resid
##      <chr>       <dbl>        <fctr>    <dbl>      <dbl>      <dbl>
##  1 Abilene    6.169925             1 6.542018 0.07043248 -0.3720932
##  2 Abilene    6.614710             2 6.895814 0.07043248 -0.2811046
##  3 Abilene    7.022368             3 7.216811 0.07043248 -0.1944434
##  4 Abilene    6.614710             4 7.290955 0.07043248 -0.6762452
##  5 Abilene    7.139551             5 7.458503 0.07043248 -0.3189515
##  6 Abilene    7.285402             6 7.544459 0.07043248 -0.2590572
##  7 Abilene    7.247928             7 7.496002 0.07043248 -0.2480749
##  8 Abilene    7.033423             8 7.475776 0.07274235 -0.4423529
##  9 Abilene    6.700440             9 7.145685 0.07274235 -0.4452453
## 10 Abilene    6.658211            10 7.150457 0.07274235 -0.4922458
## # ... with 8,024 more rows, and 7 more variables: .hat <dbl>,
## #   .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, .rownames <chr>,
## #   `log2(sales)` <dbl>, `factor(month)` <fctr>
p <- ggplot(obsSummary, aes(.fitted,.resid)) + 
  geom_point() 
p

We see in the above plot a reduction in variance of the residuals as the fitted values get larger. The plots look much better when we add month (the variation of residuals seems to be high for month=NA):

p + facet_wrap(~factor.month.)

#p + facet_wrap(~city)
  1. Create a time series of log(sales) for each city. Highlight points tha thave a standardized residual of greater than 2.
largeResid <- obsSummary %>% filter(abs(.std.resid)>2)
largeResid
## # A tibble: 268 x 13
## # Groups:   city [43]
##         city log2.sales. factor.month.  .fitted    .se.fit     .resid
##        <chr>       <dbl>        <fctr>    <dbl>      <dbl>      <dbl>
##  1   Abilene    6.614710             4 7.290955 0.07043248 -0.6762452
##  2   Abilene    6.714246             4 7.290955 0.07043248 -0.5767095
##  3   Abilene    7.303781             1 6.542018 0.07043248  0.7617625
##  4   Abilene    8.066089             7 7.496002 0.07043248  0.5700868
##  5  Amarillo    6.672425             1 7.282885 0.07539060 -0.6104594
##  6  Amarillo    7.303781             9 7.891996 0.07786308 -0.5882150
##  7  Amarillo    6.066089            10 7.718550 0.07786308 -1.6524607
##  8  Amarillo    7.507795             7 8.092723 0.07539060 -0.5849281
##  9 Arlington    9.491853             6 8.958743 0.06684046  0.5331099
## 10 Arlington    9.481799             8 8.942010 0.06903253  0.5397894
## # ... with 258 more rows, and 7 more variables: .hat <dbl>, .sigma <dbl>,
## #   .cooksd <dbl>, .std.resid <dbl>, .rownames <chr>, `log2(sales)` <dbl>,
## #   `factor(month)` <fctr>
ggplot(txhousing,aes(date,log(sales))) + geom_line(aes(group=city))
## Warning: Removed 430 rows containing missing values (geom_path).

TODO: How to know which of the original data points map to the records in obsSummary?